This project is to explores a dataset, Loan Data from Prosper, provided by Udacity. It was downloaded using the link; https://www.google.com/url?q=https://s3.amazonaws.com/udacity-hosted-downloads/ud651/prosperLoanData.csv&sa=D&ust=1581581520570000 . This dataset contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others.
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
import seaborn as sb
%matplotlib inline
import warnings
warnings.simplefilter("ignore")
#Load data
df_loan = pd.read_csv('prosperLoanData.csv')
# set maximum number of columns to be displayed to number of columns in dataset
pd.set_option('display.max_columns', df_loan.shape[1])
#load loan data to visually access it
df_loan
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | EstimatedEffectiveYield | EstimatedLoss | EstimatedReturn | ProsperRating (numeric) | ProsperRating (Alpha) | ProsperScore | ListingCategory (numeric) | BorrowerState | Occupation | EmploymentStatus | EmploymentStatusDuration | IsBorrowerHomeowner | CurrentlyInGroup | GroupKey | DateCreditPulled | CreditScoreRangeLower | CreditScoreRangeUpper | FirstRecordedCreditLine | CurrentCreditLines | OpenCreditLines | TotalCreditLinespast7years | OpenRevolvingAccounts | OpenRevolvingMonthlyPayment | InquiriesLast6Months | TotalInquiries | CurrentDelinquencies | AmountDelinquent | DelinquenciesLast7Years | PublicRecordsLast10Years | PublicRecordsLast12Months | RevolvingCreditBalance | BankcardUtilization | AvailableBankcardCredit | TotalTrades | TradesNeverDelinquent (percentage) | TradesOpenedLast6Months | DebtToIncomeRatio | IncomeRange | IncomeVerifiable | StatedMonthlyIncome | LoanKey | TotalProsperLoans | TotalProsperPaymentsBilled | OnTimeProsperPayments | ProsperPaymentsLessThanOneMonthLate | ProsperPaymentsOneMonthPlusLate | ProsperPrincipalBorrowed | ProsperPrincipalOutstanding | ScorexChangeAtTimeOfListing | LoanCurrentDaysDelinquent | LoanFirstDefaultedCycleNumber | LoanMonthsSinceOrigination | LoanNumber | LoanOriginalAmount | LoanOriginationDate | LoanOriginationQuarter | MemberKey | MonthlyLoanPayment | LP_CustomerPayments | LP_CustomerPrincipalPayments | LP_InterestandFees | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 193129 | 2007-08-26 19:09:29.263000000 | C | 36 | Completed | 2009-08-14 00:00:00 | 0.16516 | 0.1580 | 0.1380 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | CO | Other | Self-employed | 2.0 | True | True | NaN | 2007-08-26 18:41:46.780000000 | 640.0 | 659.0 | 2001-10-11 00:00:00 | 5.0 | 4.0 | 12.0 | 1 | 24.0 | 3.0 | 3.0 | 2.0 | 472.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.00 | 1500.0 | 11.0 | 0.81 | 0.0 | 0.17 | $25,000-49,999 | True | 3083.333333 | E33A3400205839220442E84 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 78 | 19141 | 9425 | 2007-09-12 00:00:00 | Q3 2007 | 1F3E3376408759268057EDA | 330.43 | 11396.1400 | 9425.00 | 1971.1400 | -133.18 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 258 |
| 1 | 10273602499503308B223C1 | 1209647 | 2014-02-27 08:28:07.900000000 | NaN | 36 | Current | NaN | 0.12016 | 0.0920 | 0.0820 | 0.07960 | 0.0249 | 0.05470 | 6.0 | A | 7.0 | 2 | CO | Professional | Employed | 44.0 | False | False | NaN | 2014-02-27 08:28:14 | 680.0 | 699.0 | 1996-03-18 00:00:00 | 14.0 | 14.0 | 29.0 | 13 | 389.0 | 3.0 | 5.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3989.0 | 0.21 | 10266.0 | 29.0 | 1.00 | 2.0 | 0.18 | $50,000-74,999 | True | 6125.000000 | 9E3B37071505919926B1D82 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 0 | 134815 | 10000 | 2014-03-03 00:00:00 | Q1 2014 | 1D13370546739025387B2F4 | 318.93 | 0.0000 | 0.00 | 0.0000 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 2 | 0EE9337825851032864889A | 81716 | 2007-01-05 15:00:47.090000000 | HR | 36 | Completed | 2009-12-17 00:00:00 | 0.28269 | 0.2750 | 0.2400 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | GA | Other | Not available | NaN | False | True | 783C3371218786870A73D20 | 2007-01-02 14:09:10.060000000 | 480.0 | 499.0 | 2002-07-27 00:00:00 | NaN | NaN | 3.0 | 0 | 0.0 | 0.0 | 1.0 | 1.0 | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.06 | Not displayed | True | 2083.333333 | 6954337960046817851BCB2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 86 | 6466 | 3001 | 2007-01-17 00:00:00 | Q1 2007 | 5F7033715035555618FA612 | 123.32 | 4186.6300 | 3001.00 | 1185.6300 | -24.20 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 41 |
| 3 | 0EF5356002482715299901A | 658116 | 2012-10-22 11:02:35.010000000 | NaN | 36 | Current | NaN | 0.12528 | 0.0974 | 0.0874 | 0.08490 | 0.0249 | 0.06000 | 6.0 | A | 9.0 | 16 | GA | Skilled Labor | Employed | 113.0 | True | False | NaN | 2012-10-22 11:02:32 | 800.0 | 819.0 | 1983-02-28 00:00:00 | 5.0 | 5.0 | 29.0 | 7 | 115.0 | 0.0 | 1.0 | 4.0 | 10056.0 | 14.0 | 0.0 | 0.0 | 1444.0 | 0.04 | 30754.0 | 26.0 | 0.76 | 0.0 | 0.15 | $25,000-49,999 | True | 2875.000000 | A0393664465886295619C51 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 16 | 77296 | 10000 | 2012-11-01 00:00:00 | Q4 2012 | 9ADE356069835475068C6D2 | 321.45 | 5143.2000 | 4091.09 | 1052.1100 | -108.01 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 158 |
| 4 | 0F023589499656230C5E3E2 | 909464 | 2013-09-14 18:38:39.097000000 | NaN | 36 | Current | NaN | 0.24614 | 0.2085 | 0.1985 | 0.18316 | 0.0925 | 0.09066 | 3.0 | D | 4.0 | 2 | MN | Executive | Employed | 44.0 | True | False | NaN | 2013-09-14 18:38:44 | 680.0 | 699.0 | 2004-02-20 00:00:00 | 19.0 | 19.0 | 49.0 | 6 | 220.0 | 1.0 | 9.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6193.0 | 0.81 | 695.0 | 39.0 | 0.95 | 2.0 | 0.26 | $100,000+ | True | 9583.333333 | A180369302188889200689E | 1.0 | 11.0 | 11.0 | 0.0 | 0.0 | 11000.0 | 9947.90 | NaN | 0 | NaN | 6 | 102670 | 15000 | 2013-09-20 00:00:00 | Q3 2013 | 36CE356043264555721F06C | 563.97 | 2819.8500 | 1563.22 | 1256.6300 | -60.27 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 20 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 113932 | E6D9357655724827169606C | 753087 | 2013-04-14 05:55:02.663000000 | NaN | 36 | Current | NaN | 0.22354 | 0.1864 | 0.1764 | 0.16490 | 0.0699 | 0.09500 | 4.0 | C | 5.0 | 1 | IL | Food Service Management | Employed | 246.0 | True | False | NaN | 2013-04-14 05:54:58 | 700.0 | 719.0 | 1997-09-01 00:00:00 | 9.0 | 9.0 | 41.0 | 9 | 209.0 | 0.0 | 0.0 | 0.0 | 0.0 | 7.0 | 1.0 | 0.0 | 7714.0 | 0.80 | 1886.0 | 37.0 | 0.83 | 3.0 | 0.13 | $50,000-74,999 | True | 4333.333333 | 9BD7367919051593140DB62 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 11 | 88485 | 10000 | 2013-04-22 00:00:00 | Q2 2013 | 2EC435768441332602FDC15 | 364.74 | 3647.4000 | 2238.38 | 1409.0200 | -75.58 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 113933 | E6DB353036033497292EE43 | 537216 | 2011-11-03 20:42:55.333000000 | NaN | 36 | FinalPaymentInProgress | NaN | 0.13220 | 0.1110 | 0.1010 | 0.10070 | 0.0200 | 0.08070 | 6.0 | A | 8.0 | 7 | PA | Professional | Employed | 21.0 | True | False | NaN | 2011-11-03 20:42:53 | 700.0 | 719.0 | 1992-01-17 00:00:00 | 14.0 | 13.0 | 39.0 | 9 | 495.0 | 1.0 | 4.0 | 1.0 | 5062.0 | 4.0 | 0.0 | 0.0 | 15743.0 | 0.69 | 6658.0 | 39.0 | 0.92 | 0.0 | 0.11 | $75,000-99,999 | True | 8041.666667 | 62D93634569816897D5A276 | 3.0 | 60.0 | 60.0 | 0.0 | 0.0 | 33501.0 | 4815.42 | -26.0 | 0 | NaN | 28 | 55801 | 2000 | 2011-11-07 00:00:00 | Q4 2011 | 55C4336679182766893E4FC | 65.57 | 2330.5500 | 1997.16 | 333.3900 | -30.05 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 22 |
| 113934 | E6E13596170052029692BB1 | 1069178 | 2013-12-13 05:49:12.703000000 | NaN | 60 | Current | NaN | 0.23984 | 0.2150 | 0.2050 | 0.18828 | 0.1025 | 0.08578 | 3.0 | D | 3.0 | 1 | TX | Other | Employed | 84.0 | True | False | NaN | 2013-12-13 05:49:15 | 700.0 | 719.0 | 2002-02-25 00:00:00 | 12.0 | 12.0 | 25.0 | 9 | 521.0 | 1.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 22147.0 | 0.73 | 7853.0 | 25.0 | 1.00 | 0.0 | 0.51 | $25,000-49,999 | True | 2875.000000 | DD1A370200396006300ACA0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 3 | 123122 | 10000 | 2013-12-23 00:00:00 | Q4 2013 | 0FE0370029359765342FDB5 | 273.35 | 546.7000 | 183.15 | 363.5500 | -16.91 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 119 |
| 113935 | E6EB3531504622671970D9E | 539056 | 2011-11-14 13:18:26.597000000 | NaN | 60 | Completed | 2013-08-13 00:00:00 | 0.28408 | 0.2605 | 0.2505 | 0.24450 | 0.0850 | 0.15950 | 4.0 | C | 5.0 | 2 | GA | Food Service | Full-time | 94.0 | True | False | NaN | 2011-11-14 13:18:24 | 680.0 | 699.0 | 1993-12-01 00:00:00 | 11.0 | 11.0 | 22.0 | 7 | 488.0 | 1.0 | 4.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 11956.0 | 0.69 | 4137.0 | 19.0 | 0.80 | 1.0 | 0.48 | $25,000-49,999 | True | 3875.000000 | 589536350469116027ED11B | 1.0 | 16.0 | 16.0 | 0.0 | 0.0 | 5000.0 | 3264.37 | -4.0 | 0 | NaN | 28 | 56401 | 15000 | 2011-11-21 00:00:00 | Q4 2011 | A33834861822272782621C8 | 449.55 | 21122.5600 | 15000.00 | 6122.5600 | -235.05 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 274 |
| 113936 | E6ED3600409833199F711B7 | 1140093 | 2014-01-15 09:27:37.657000000 | NaN | 36 | Current | NaN | 0.13189 | 0.1039 | 0.0939 | 0.09071 | 0.0299 | 0.06081 | 6.0 | A | 7.0 | 1 | NY | Professor | Employed | 244.0 | False | False | NaN | 2014-01-15 09:27:40 | 680.0 | 699.0 | 1995-01-01 00:00:00 | 10.0 | 9.0 | 44.0 | 8 | 289.0 | 0.0 | 1.0 | 1.0 | 257.0 | 3.0 | 1.0 | 0.0 | 6166.0 | 0.80 | 675.0 | 36.0 | 0.75 | 0.0 | 0.23 | $50,000-74,999 | True | 4583.333333 | 00AF3704550953269A64E40 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 2 | 127508 | 2000 | 2014-01-21 00:00:00 | Q1 2014 | CE1E3704648000761C9F724 | 64.90 | 64.3307 | 47.25 | 17.0807 | -1.70 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
113937 rows × 81 columns
#Check top rows of the data
df_loan.head()
| ListingKey | ListingNumber | ListingCreationDate | CreditGrade | Term | LoanStatus | ClosedDate | BorrowerAPR | BorrowerRate | LenderYield | EstimatedEffectiveYield | EstimatedLoss | EstimatedReturn | ProsperRating (numeric) | ProsperRating (Alpha) | ProsperScore | ListingCategory (numeric) | BorrowerState | Occupation | EmploymentStatus | EmploymentStatusDuration | IsBorrowerHomeowner | CurrentlyInGroup | GroupKey | DateCreditPulled | CreditScoreRangeLower | CreditScoreRangeUpper | FirstRecordedCreditLine | CurrentCreditLines | OpenCreditLines | TotalCreditLinespast7years | OpenRevolvingAccounts | OpenRevolvingMonthlyPayment | InquiriesLast6Months | TotalInquiries | CurrentDelinquencies | AmountDelinquent | DelinquenciesLast7Years | PublicRecordsLast10Years | PublicRecordsLast12Months | RevolvingCreditBalance | BankcardUtilization | AvailableBankcardCredit | TotalTrades | TradesNeverDelinquent (percentage) | TradesOpenedLast6Months | DebtToIncomeRatio | IncomeRange | IncomeVerifiable | StatedMonthlyIncome | LoanKey | TotalProsperLoans | TotalProsperPaymentsBilled | OnTimeProsperPayments | ProsperPaymentsLessThanOneMonthLate | ProsperPaymentsOneMonthPlusLate | ProsperPrincipalBorrowed | ProsperPrincipalOutstanding | ScorexChangeAtTimeOfListing | LoanCurrentDaysDelinquent | LoanFirstDefaultedCycleNumber | LoanMonthsSinceOrigination | LoanNumber | LoanOriginalAmount | LoanOriginationDate | LoanOriginationQuarter | MemberKey | MonthlyLoanPayment | LP_CustomerPayments | LP_CustomerPrincipalPayments | LP_InterestandFees | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1021339766868145413AB3B | 193129 | 2007-08-26 19:09:29.263000000 | C | 36 | Completed | 2009-08-14 00:00:00 | 0.16516 | 0.1580 | 0.1380 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | CO | Other | Self-employed | 2.0 | True | True | NaN | 2007-08-26 18:41:46.780000000 | 640.0 | 659.0 | 2001-10-11 00:00:00 | 5.0 | 4.0 | 12.0 | 1 | 24.0 | 3.0 | 3.0 | 2.0 | 472.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.00 | 1500.0 | 11.0 | 0.81 | 0.0 | 0.17 | $25,000-49,999 | True | 3083.333333 | E33A3400205839220442E84 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 78 | 19141 | 9425 | 2007-09-12 00:00:00 | Q3 2007 | 1F3E3376408759268057EDA | 330.43 | 11396.14 | 9425.00 | 1971.14 | -133.18 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 258 |
| 1 | 10273602499503308B223C1 | 1209647 | 2014-02-27 08:28:07.900000000 | NaN | 36 | Current | NaN | 0.12016 | 0.0920 | 0.0820 | 0.07960 | 0.0249 | 0.05470 | 6.0 | A | 7.0 | 2 | CO | Professional | Employed | 44.0 | False | False | NaN | 2014-02-27 08:28:14 | 680.0 | 699.0 | 1996-03-18 00:00:00 | 14.0 | 14.0 | 29.0 | 13 | 389.0 | 3.0 | 5.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3989.0 | 0.21 | 10266.0 | 29.0 | 1.00 | 2.0 | 0.18 | $50,000-74,999 | True | 6125.000000 | 9E3B37071505919926B1D82 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 0 | 134815 | 10000 | 2014-03-03 00:00:00 | Q1 2014 | 1D13370546739025387B2F4 | 318.93 | 0.00 | 0.00 | 0.00 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 1 |
| 2 | 0EE9337825851032864889A | 81716 | 2007-01-05 15:00:47.090000000 | HR | 36 | Completed | 2009-12-17 00:00:00 | 0.28269 | 0.2750 | 0.2400 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | GA | Other | Not available | NaN | False | True | 783C3371218786870A73D20 | 2007-01-02 14:09:10.060000000 | 480.0 | 499.0 | 2002-07-27 00:00:00 | NaN | NaN | 3.0 | 0 | 0.0 | 0.0 | 1.0 | 1.0 | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.06 | Not displayed | True | 2083.333333 | 6954337960046817851BCB2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 86 | 6466 | 3001 | 2007-01-17 00:00:00 | Q1 2007 | 5F7033715035555618FA612 | 123.32 | 4186.63 | 3001.00 | 1185.63 | -24.20 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 41 |
| 3 | 0EF5356002482715299901A | 658116 | 2012-10-22 11:02:35.010000000 | NaN | 36 | Current | NaN | 0.12528 | 0.0974 | 0.0874 | 0.08490 | 0.0249 | 0.06000 | 6.0 | A | 9.0 | 16 | GA | Skilled Labor | Employed | 113.0 | True | False | NaN | 2012-10-22 11:02:32 | 800.0 | 819.0 | 1983-02-28 00:00:00 | 5.0 | 5.0 | 29.0 | 7 | 115.0 | 0.0 | 1.0 | 4.0 | 10056.0 | 14.0 | 0.0 | 0.0 | 1444.0 | 0.04 | 30754.0 | 26.0 | 0.76 | 0.0 | 0.15 | $25,000-49,999 | True | 2875.000000 | A0393664465886295619C51 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0 | NaN | 16 | 77296 | 10000 | 2012-11-01 00:00:00 | Q4 2012 | 9ADE356069835475068C6D2 | 321.45 | 5143.20 | 4091.09 | 1052.11 | -108.01 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 158 |
| 4 | 0F023589499656230C5E3E2 | 909464 | 2013-09-14 18:38:39.097000000 | NaN | 36 | Current | NaN | 0.24614 | 0.2085 | 0.1985 | 0.18316 | 0.0925 | 0.09066 | 3.0 | D | 4.0 | 2 | MN | Executive | Employed | 44.0 | True | False | NaN | 2013-09-14 18:38:44 | 680.0 | 699.0 | 2004-02-20 00:00:00 | 19.0 | 19.0 | 49.0 | 6 | 220.0 | 1.0 | 9.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6193.0 | 0.81 | 695.0 | 39.0 | 0.95 | 2.0 | 0.26 | $100,000+ | True | 9583.333333 | A180369302188889200689E | 1.0 | 11.0 | 11.0 | 0.0 | 0.0 | 11000.0 | 9947.9 | NaN | 0 | NaN | 6 | 102670 | 15000 | 2013-09-20 00:00:00 | Q3 2013 | 36CE356043264555721F06C | 563.97 | 2819.85 | 1563.22 | 1256.63 | -60.27 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 0 | 0.0 | 20 |
#Check number of rows and columns()
df_loan.shape
(113937, 81)
#Check data characteristics
df_loan.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 113937 entries, 0 to 113936 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ListingKey 113937 non-null object 1 ListingNumber 113937 non-null int64 2 ListingCreationDate 113937 non-null object 3 CreditGrade 28953 non-null object 4 Term 113937 non-null int64 5 LoanStatus 113937 non-null object 6 ClosedDate 55089 non-null object 7 BorrowerAPR 113912 non-null float64 8 BorrowerRate 113937 non-null float64 9 LenderYield 113937 non-null float64 10 EstimatedEffectiveYield 84853 non-null float64 11 EstimatedLoss 84853 non-null float64 12 EstimatedReturn 84853 non-null float64 13 ProsperRating (numeric) 84853 non-null float64 14 ProsperRating (Alpha) 84853 non-null object 15 ProsperScore 84853 non-null float64 16 ListingCategory (numeric) 113937 non-null int64 17 BorrowerState 108422 non-null object 18 Occupation 110349 non-null object 19 EmploymentStatus 111682 non-null object 20 EmploymentStatusDuration 106312 non-null float64 21 IsBorrowerHomeowner 113937 non-null bool 22 CurrentlyInGroup 113937 non-null bool 23 GroupKey 13341 non-null object 24 DateCreditPulled 113937 non-null object 25 CreditScoreRangeLower 113346 non-null float64 26 CreditScoreRangeUpper 113346 non-null float64 27 FirstRecordedCreditLine 113240 non-null object 28 CurrentCreditLines 106333 non-null float64 29 OpenCreditLines 106333 non-null float64 30 TotalCreditLinespast7years 113240 non-null float64 31 OpenRevolvingAccounts 113937 non-null int64 32 OpenRevolvingMonthlyPayment 113937 non-null float64 33 InquiriesLast6Months 113240 non-null float64 34 TotalInquiries 112778 non-null float64 35 CurrentDelinquencies 113240 non-null float64 36 AmountDelinquent 106315 non-null float64 37 DelinquenciesLast7Years 112947 non-null float64 38 PublicRecordsLast10Years 113240 non-null float64 39 PublicRecordsLast12Months 106333 non-null float64 40 RevolvingCreditBalance 106333 non-null float64 41 BankcardUtilization 106333 non-null float64 42 AvailableBankcardCredit 106393 non-null float64 43 TotalTrades 106393 non-null float64 44 TradesNeverDelinquent (percentage) 106393 non-null float64 45 TradesOpenedLast6Months 106393 non-null float64 46 DebtToIncomeRatio 105383 non-null float64 47 IncomeRange 113937 non-null object 48 IncomeVerifiable 113937 non-null bool 49 StatedMonthlyIncome 113937 non-null float64 50 LoanKey 113937 non-null object 51 TotalProsperLoans 22085 non-null float64 52 TotalProsperPaymentsBilled 22085 non-null float64 53 OnTimeProsperPayments 22085 non-null float64 54 ProsperPaymentsLessThanOneMonthLate 22085 non-null float64 55 ProsperPaymentsOneMonthPlusLate 22085 non-null float64 56 ProsperPrincipalBorrowed 22085 non-null float64 57 ProsperPrincipalOutstanding 22085 non-null float64 58 ScorexChangeAtTimeOfListing 18928 non-null float64 59 LoanCurrentDaysDelinquent 113937 non-null int64 60 LoanFirstDefaultedCycleNumber 16952 non-null float64 61 LoanMonthsSinceOrigination 113937 non-null int64 62 LoanNumber 113937 non-null int64 63 LoanOriginalAmount 113937 non-null int64 64 LoanOriginationDate 113937 non-null object 65 LoanOriginationQuarter 113937 non-null object 66 MemberKey 113937 non-null object 67 MonthlyLoanPayment 113937 non-null float64 68 LP_CustomerPayments 113937 non-null float64 69 LP_CustomerPrincipalPayments 113937 non-null float64 70 LP_InterestandFees 113937 non-null float64 71 LP_ServiceFees 113937 non-null float64 72 LP_CollectionFees 113937 non-null float64 73 LP_GrossPrincipalLoss 113937 non-null float64 74 LP_NetPrincipalLoss 113937 non-null float64 75 LP_NonPrincipalRecoverypayments 113937 non-null float64 76 PercentFunded 113937 non-null float64 77 Recommendations 113937 non-null int64 78 InvestmentFromFriendsCount 113937 non-null int64 79 InvestmentFromFriendsAmount 113937 non-null float64 80 Investors 113937 non-null int64 dtypes: bool(3), float64(50), int64(11), object(17) memory usage: 68.1+ MB
#Check data description
df_loan.describe()
| ListingNumber | Term | BorrowerAPR | BorrowerRate | LenderYield | EstimatedEffectiveYield | EstimatedLoss | EstimatedReturn | ProsperRating (numeric) | ProsperScore | ListingCategory (numeric) | EmploymentStatusDuration | CreditScoreRangeLower | CreditScoreRangeUpper | CurrentCreditLines | OpenCreditLines | TotalCreditLinespast7years | OpenRevolvingAccounts | OpenRevolvingMonthlyPayment | InquiriesLast6Months | TotalInquiries | CurrentDelinquencies | AmountDelinquent | DelinquenciesLast7Years | PublicRecordsLast10Years | PublicRecordsLast12Months | RevolvingCreditBalance | BankcardUtilization | AvailableBankcardCredit | TotalTrades | TradesNeverDelinquent (percentage) | TradesOpenedLast6Months | DebtToIncomeRatio | StatedMonthlyIncome | TotalProsperLoans | TotalProsperPaymentsBilled | OnTimeProsperPayments | ProsperPaymentsLessThanOneMonthLate | ProsperPaymentsOneMonthPlusLate | ProsperPrincipalBorrowed | ProsperPrincipalOutstanding | ScorexChangeAtTimeOfListing | LoanCurrentDaysDelinquent | LoanFirstDefaultedCycleNumber | LoanMonthsSinceOrigination | LoanNumber | LoanOriginalAmount | MonthlyLoanPayment | LP_CustomerPayments | LP_CustomerPrincipalPayments | LP_InterestandFees | LP_ServiceFees | LP_CollectionFees | LP_GrossPrincipalLoss | LP_NetPrincipalLoss | LP_NonPrincipalRecoverypayments | PercentFunded | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.139370e+05 | 113937.000000 | 113912.000000 | 113937.000000 | 113937.000000 | 84853.000000 | 84853.000000 | 84853.000000 | 84853.000000 | 84853.000000 | 113937.000000 | 106312.000000 | 113346.000000 | 113346.000000 | 106333.000000 | 106333.000000 | 113240.000000 | 113937.00000 | 113937.000000 | 113240.000000 | 112778.000000 | 113240.000000 | 106315.000000 | 112947.000000 | 113240.000000 | 106333.000000 | 1.063330e+05 | 106333.000000 | 106393.000000 | 106393.000000 | 106393.000000 | 106393.000000 | 105383.000000 | 1.139370e+05 | 22085.000000 | 22085.000000 | 22085.000000 | 22085.000000 | 22085.000000 | 22085.000000 | 22085.000000 | 18928.000000 | 113937.000000 | 16952.000000 | 113937.000000 | 113937.000000 | 113937.00000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 | 113937.000000 |
| mean | 6.278857e+05 | 40.830248 | 0.218828 | 0.192764 | 0.182701 | 0.168661 | 0.080306 | 0.096068 | 4.072243 | 5.950067 | 2.774209 | 96.071582 | 685.567731 | 704.567731 | 10.317192 | 9.260164 | 26.754539 | 6.96979 | 398.292161 | 1.435085 | 5.584405 | 0.592052 | 984.507059 | 4.154984 | 0.312646 | 0.015094 | 1.759871e+04 | 0.561309 | 11210.225447 | 23.230034 | 0.885897 | 0.802327 | 0.275947 | 5.608026e+03 | 1.421100 | 22.934345 | 22.271949 | 0.613629 | 0.048540 | 8472.311961 | 2930.313906 | -3.223214 | 152.816539 | 16.268464 | 31.896882 | 69444.474271 | 8337.01385 | 272.475783 | 4183.079489 | 3105.536588 | 1077.542901 | -54.725641 | -14.242698 | 700.446342 | 681.420499 | 25.142686 | 0.998584 | 0.048027 | 0.023460 | 16.550751 | 80.475228 |
| std | 3.280762e+05 | 10.436212 | 0.080364 | 0.074818 | 0.074516 | 0.068467 | 0.046764 | 0.030403 | 1.673227 | 2.376501 | 3.996797 | 94.480605 | 66.458275 | 66.458275 | 5.457866 | 5.022644 | 13.637871 | 4.63097 | 447.159711 | 2.437507 | 6.429946 | 1.978707 | 7158.270157 | 10.160216 | 0.727868 | 0.154092 | 3.293640e+04 | 0.317918 | 19818.361309 | 11.871311 | 0.148179 | 1.097637 | 0.551759 | 7.478497e+03 | 0.764042 | 19.249584 | 18.830425 | 2.446827 | 0.556285 | 7395.507650 | 3806.635075 | 50.063567 | 466.320254 | 9.005898 | 29.974184 | 38930.479610 | 6245.80058 | 192.697812 | 4790.907234 | 4069.527670 | 1183.414168 | 60.675425 | 109.232758 | 2388.513831 | 2357.167068 | 275.657937 | 0.017919 | 0.332353 | 0.232412 | 294.545422 | 103.239020 |
| min | 4.000000e+00 | 12.000000 | 0.006530 | 0.000000 | -0.010000 | -0.182700 | 0.004900 | -0.182700 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 19.000000 | 0.000000 | 0.000000 | 2.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -209.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1000.00000 | 0.000000 | -2.349900 | 0.000000 | -2.349900 | -664.870000 | -9274.750000 | -94.200000 | -954.550000 | 0.000000 | 0.700000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 4.009190e+05 | 36.000000 | 0.156290 | 0.134000 | 0.124200 | 0.115670 | 0.042400 | 0.074080 | 3.000000 | 4.000000 | 1.000000 | 26.000000 | 660.000000 | 679.000000 | 7.000000 | 6.000000 | 17.000000 | 4.00000 | 114.000000 | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.121000e+03 | 0.310000 | 880.000000 | 15.000000 | 0.820000 | 0.000000 | 0.140000 | 3.200333e+03 | 1.000000 | 9.000000 | 9.000000 | 0.000000 | 0.000000 | 3500.000000 | 0.000000 | -35.000000 | 0.000000 | 9.000000 | 6.000000 | 37332.000000 | 4000.00000 | 131.620000 | 1005.760000 | 500.890000 | 274.870000 | -73.180000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 50% | 6.005540e+05 | 36.000000 | 0.209760 | 0.184000 | 0.173000 | 0.161500 | 0.072400 | 0.091700 | 4.000000 | 6.000000 | 1.000000 | 67.000000 | 680.000000 | 699.000000 | 10.000000 | 9.000000 | 25.000000 | 6.00000 | 271.000000 | 1.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 8.549000e+03 | 0.600000 | 4100.000000 | 22.000000 | 0.940000 | 0.000000 | 0.220000 | 4.666667e+03 | 1.000000 | 16.000000 | 15.000000 | 0.000000 | 0.000000 | 6000.000000 | 1626.550000 | -3.000000 | 0.000000 | 14.000000 | 21.000000 | 68599.000000 | 6500.00000 | 217.740000 | 2583.830000 | 1587.500000 | 700.840100 | -34.440000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 44.000000 |
| 75% | 8.926340e+05 | 36.000000 | 0.283810 | 0.250000 | 0.240000 | 0.224300 | 0.112000 | 0.116600 | 5.000000 | 8.000000 | 3.000000 | 137.000000 | 720.000000 | 739.000000 | 13.000000 | 12.000000 | 35.000000 | 9.00000 | 525.000000 | 2.000000 | 7.000000 | 0.000000 | 0.000000 | 3.000000 | 0.000000 | 0.000000 | 1.952100e+04 | 0.840000 | 13180.000000 | 30.000000 | 1.000000 | 1.000000 | 0.320000 | 6.825000e+03 | 2.000000 | 33.000000 | 32.000000 | 0.000000 | 0.000000 | 11000.000000 | 4126.720000 | 25.000000 | 0.000000 | 22.000000 | 65.000000 | 101901.000000 | 12000.00000 | 371.580000 | 5548.400000 | 4000.000000 | 1458.540000 | -13.920000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 115.000000 |
| max | 1.255725e+06 | 60.000000 | 0.512290 | 0.497500 | 0.492500 | 0.319900 | 0.366000 | 0.283700 | 7.000000 | 11.000000 | 20.000000 | 755.000000 | 880.000000 | 899.000000 | 59.000000 | 54.000000 | 136.000000 | 51.00000 | 14985.000000 | 105.000000 | 379.000000 | 83.000000 | 463881.000000 | 99.000000 | 38.000000 | 20.000000 | 1.435667e+06 | 5.950000 | 646285.000000 | 126.000000 | 1.000000 | 20.000000 | 10.010000 | 1.750003e+06 | 8.000000 | 141.000000 | 141.000000 | 42.000000 | 21.000000 | 72499.000000 | 23450.950000 | 286.000000 | 2704.000000 | 44.000000 | 100.000000 | 136486.000000 | 35000.00000 | 2251.510000 | 40702.390000 | 35000.000000 | 15617.030000 | 32.060000 | 0.000000 | 25000.000000 | 25000.000000 | 21117.900000 | 1.012500 | 39.000000 | 33.000000 | 25000.000000 | 1189.000000 |
#Check datatypes
df_loan.dtypes
ListingKey object
ListingNumber int64
ListingCreationDate object
CreditGrade object
Term int64
...
PercentFunded float64
Recommendations int64
InvestmentFromFriendsCount int64
InvestmentFromFriendsAmount float64
Investors int64
Length: 81, dtype: object
#Check for null values
df_loan.isnull().sum()
ListingKey 0
ListingNumber 0
ListingCreationDate 0
CreditGrade 84984
Term 0
...
PercentFunded 0
Recommendations 0
InvestmentFromFriendsCount 0
InvestmentFromFriendsAmount 0
Investors 0
Length: 81, dtype: int64
#Check for duplicates
df_loan.duplicated().sum()
0
# Visualise the relative proportions of each datatype to other datatypes
datatype = df_loan.dtypes.to_frame()
datatype.columns = ['dtype']
datatype['dtype'] = datatype['dtype'].astype('category')
datatype['dtype'] = datatype['dtype'].apply(lambda x: str(x))
datatype.T
# Calculate the relative proportions
no_of_obsevation = datatype['dtype'].shape[0]
max_proportion = datatype['dtype'].value_counts().to_list()[0]/no_of_obsevation
tick_proportions = np.arange(0, max_proportion, 0.03)
tick_names = ('{:0.2f}'.format(v) for v in tick_proportions)
plt.figure(figsize =(6,6))
order = datatype['dtype'].value_counts().index.to_list()
sb.countplot(x = "dtype" , data =datatype, order = order, color =sb.color_palette()[4],)
plt.xlabel('data types', fontdict = {'weight': 'bold'})
plt.ylabel('count of data types', fontdict = {'weight': 'bold'})
plt.yticks(tick_proportions * no_of_obsevation, tick_names);
#Calculate and display the percentage of each datatype
for i in range(datatype['dtype'].value_counts().shape[0]):
count = datatype['dtype'].value_counts().to_list()[i]
pct_string = '{:0.1f}%'.format(100*(count/no_of_obsevation))
plt.text(i,count+1, pct_string, ha = 'center');
# make a copy of original data to clean
df_loan_copy =df_loan.copy()
Define
Code
df_loan_copy['ProsperRating (Alpha)'] = df_loan_copy[['CreditGrade', 'ProsperRating (Alpha)']]\
['ProsperRating (Alpha)'].fillna(df_loan_copy['CreditGrade'])
df_loan_copy['CreditGrade'] = df_loan_copy[['CreditGrade', 'ProsperRating (Alpha)']]['CreditGrade']\
.fillna(df_loan_copy['ProsperRating (Alpha)'])
Test
df_loan_copy[['CreditGrade', 'ProsperRating (Alpha)']].head(5)
| CreditGrade | ProsperRating (Alpha) | |
|---|---|---|
| 0 | C | C |
| 1 | A | A |
| 2 | HR | HR |
| 3 | A | A |
| 4 | D | D |
Define
Code
df_loan_copy['LoanOriginationQuarter'] = df_loan_copy['LoanOriginationQuarter'].apply(lambda x: x.split(' ')[0])
Test
df_loan_copy['LoanOriginationQuarter'][1]
'Q1'
Define
Code
# ordinal categorical
ordinal_categorical = {'ProsperRating (Alpha)':['AA','A','B','C','D','E','HR'],
'CreditGrade': ['AA','A','B','C','D','E','HR'],
'IncomeRange': ['$0', '$1 - 24,999', '$25,000 - 49,999', '$50,000 - 74,999',\
'$75,000 - 99,999', '$100,000+', 'Not employed', 'Not displayed'],
'LoanOriginationQuarter': ['Q1', 'Q2', 'Q3', 'Q4']
}
#nominal categorical
nominal_categorical = ['EmploymentStatus','LoanStatus', 'BorrowerState', 'Occupation']
Define
Code
for var in ordinal_categorical:
ordered_var = pd.api.types.CategoricalDtype(ordered = True,
categories = ordinal_categorical[var])
df_loan_copy[var] = df_loan_copy[var].astype(ordered_var)
for var in nominal_categorical:
df_loan_copy[var] = df_loan_copy[var].astype('category')
Test
print(df_loan_copy[ordinal_categorical].dtypes, end = '\n\nnominal data \n')
print(df_loan_copy[nominal_categorical].dtypes)
ProsperRating (Alpha) category CreditGrade category IncomeRange category LoanOriginationQuarter category dtype: object nominal data EmploymentStatus category LoanStatus category BorrowerState category Occupation category dtype: object
Define
Code
df_loan_copy = df_loan_copy[['LoanStatus', 'LenderYield', 'ProsperRating (Alpha)', 'ListingCategory (numeric)', \
'BorrowerState', 'Occupation', 'EmploymentStatus', 'IsBorrowerHomeowner', \
'CurrentlyInGroup', 'CreditScoreRangeLower', 'CreditScoreRangeUpper', \
'TotalCreditLinespast7years', 'DebtToIncomeRatio', 'IncomeRange', \
'IncomeVerifiable', 'LoanOriginationQuarter', 'MonthlyLoanPayment', \
'Recommendations', 'InvestmentFromFriendsCount', 'InvestmentFromFriendsAmount',\
'Investors', 'PercentFunded']]
Test
print(df_loan_copy.columns.to_list())
df_loan_copy.head()
['LoanStatus', 'LenderYield', 'ProsperRating (Alpha)', 'ListingCategory (numeric)', 'BorrowerState', 'Occupation', 'EmploymentStatus', 'IsBorrowerHomeowner', 'CurrentlyInGroup', 'CreditScoreRangeLower', 'CreditScoreRangeUpper', 'TotalCreditLinespast7years', 'DebtToIncomeRatio', 'IncomeRange', 'IncomeVerifiable', 'LoanOriginationQuarter', 'MonthlyLoanPayment', 'Recommendations', 'InvestmentFromFriendsCount', 'InvestmentFromFriendsAmount', 'Investors', 'PercentFunded']
| LoanStatus | LenderYield | ProsperRating (Alpha) | ListingCategory (numeric) | BorrowerState | Occupation | EmploymentStatus | IsBorrowerHomeowner | CurrentlyInGroup | CreditScoreRangeLower | CreditScoreRangeUpper | TotalCreditLinespast7years | DebtToIncomeRatio | IncomeRange | IncomeVerifiable | LoanOriginationQuarter | MonthlyLoanPayment | Recommendations | InvestmentFromFriendsCount | InvestmentFromFriendsAmount | Investors | PercentFunded | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Completed | 0.1380 | C | 0 | CO | Other | Self-employed | True | True | 640.0 | 659.0 | 12.0 | 0.17 | NaN | True | Q3 | 330.43 | 0 | 0 | 0.0 | 258 | 1.0 |
| 1 | Current | 0.0820 | A | 2 | CO | Professional | Employed | False | False | 680.0 | 699.0 | 29.0 | 0.18 | NaN | True | Q1 | 318.93 | 0 | 0 | 0.0 | 1 | 1.0 |
| 2 | Completed | 0.2400 | HR | 0 | GA | Other | Not available | False | True | 480.0 | 499.0 | 3.0 | 0.06 | Not displayed | True | Q1 | 123.32 | 0 | 0 | 0.0 | 41 | 1.0 |
| 3 | Current | 0.0874 | A | 16 | GA | Skilled Labor | Employed | True | False | 800.0 | 819.0 | 29.0 | 0.15 | NaN | True | Q4 | 321.45 | 0 | 0 | 0.0 | 158 | 1.0 |
| 4 | Current | 0.1985 | D | 2 | MN | Executive | Employed | True | False | 680.0 | 699.0 | 49.0 | 0.26 | $100,000+ | True | Q3 | 563.97 | 0 | 0 | 0.0 | 20 | 1.0 |
Define
Code
df_loan_copy = df_loan_copy.dropna(subset = ['EmploymentStatus'])
df_loan_copy = df_loan_copy.dropna(subset = ['TotalCreditLinespast7years'])
df_loan_copy = df_loan_copy.dropna(subset = ['ProsperRating (Alpha)'])
Test
#Check for null values
df_loan_copy.isnull().sum()
LoanStatus 0 LenderYield 0 ProsperRating (Alpha) 0 ListingCategory (numeric) 0 BorrowerState 4084 Occupation 1333 EmploymentStatus 0 IsBorrowerHomeowner 0 CurrentlyInGroup 0 CreditScoreRangeLower 0 CreditScoreRangeUpper 0 TotalCreditLinespast7years 0 DebtToIncomeRatio 8427 IncomeRange 87303 IncomeVerifiable 0 LoanOriginationQuarter 0 MonthlyLoanPayment 0 Recommendations 0 InvestmentFromFriendsCount 0 InvestmentFromFriendsAmount 0 Investors 0 PercentFunded 0 dtype: int64
Define
Code
df_loan_copy.drop(['IncomeRange'], axis =1, inplace = True)
Test
df_loan_copy.isnull().sum()
LoanStatus 0 LenderYield 0 ProsperRating (Alpha) 0 ListingCategory (numeric) 0 BorrowerState 4084 Occupation 1333 EmploymentStatus 0 IsBorrowerHomeowner 0 CurrentlyInGroup 0 CreditScoreRangeLower 0 CreditScoreRangeUpper 0 TotalCreditLinespast7years 0 DebtToIncomeRatio 8427 IncomeVerifiable 0 LoanOriginationQuarter 0 MonthlyLoanPayment 0 Recommendations 0 InvestmentFromFriendsCount 0 InvestmentFromFriendsAmount 0 Investors 0 PercentFunded 0 dtype: int64
Define
Code
df_loan_copy['Occupation'] = df_loan_copy['Occupation'].cat.add_categories('Not Stated')
df_loan_copy['Occupation'].fillna('Not Stated', inplace = True)
df_loan_copy['DebtToIncomeRatio'] = df_loan_copy['DebtToIncomeRatio'].interpolate()
df_loan_copy['BorrowerState'] = df_loan_copy['BorrowerState'].cat.add_categories('Not Stated')
df_loan_copy['BorrowerState'].fillna('Not Stated', inplace = True)
Test
df_loan_copy.isnull().sum()
LoanStatus 0 LenderYield 0 ProsperRating (Alpha) 0 ListingCategory (numeric) 0 BorrowerState 0 Occupation 0 EmploymentStatus 0 IsBorrowerHomeowner 0 CurrentlyInGroup 0 CreditScoreRangeLower 0 CreditScoreRangeUpper 0 TotalCreditLinespast7years 0 DebtToIncomeRatio 0 IncomeVerifiable 0 LoanOriginationQuarter 0 MonthlyLoanPayment 0 Recommendations 0 InvestmentFromFriendsCount 0 InvestmentFromFriendsAmount 0 Investors 0 PercentFunded 0 dtype: int64
Define
Code
df_loan_copy['ListingCategory (numeric)'] = df_loan_copy['ListingCategory (numeric)'].replace(\
{0: 'Not AVilable', 1: 'Debt Consolidation', 2: 'HOme Improvement',3: 'Business', 4: 'Personal Loan',\
5: 'Student Use', 6: 'Auto', 7: 'Other', 8: 'Baby & Adoption', 9: 'Boat', 10: 'Cosmetic Procedures',\
11: 'Engagement Ring', 12: 'Green Loans', 13: 'Household Expenses', 14: 'Large Purchases',\
15: 'Medical/ Dental', 16: 'Motorcycle', 17: 'RV', 18: 'Taxes', 19: 'Vacation', 20: 'Wedding Loans'})
df_loan_copy['ListingCategory (numeric)'] = df_loan_copy['ListingCategory (numeric)'].astype('category')
Test
print(df_loan_copy['ListingCategory (numeric)'].dtype)
df_loan_copy['ListingCategory (numeric)'].value_counts()
category
Debt Consolidation 58242 Not AVilable 14572 Other 10469 HOme Improvement 7421 Business 7172 Auto 2564 Personal Loan 2392 Household Expenses 1996 Medical/ Dental 1522 Taxes 885 Large Purchases 876 Wedding Loans 771 Vacation 768 Student Use 750 Motorcycle 304 Engagement Ring 217 Baby & Adoption 199 Cosmetic Procedures 91 Boat 85 Green Loans 59 RV 52 Name: ListingCategory (numeric), dtype: int64
Define
Code
df_loan_copy.rename(columns = {'ListingCategory (numeric)': 'ListingCategory'}, inplace = True)
Test
df_loan_copy.columns
Index(['LoanStatus', 'LenderYield', 'ProsperRating (Alpha)', 'ListingCategory',
'BorrowerState', 'Occupation', 'EmploymentStatus',
'IsBorrowerHomeowner', 'CurrentlyInGroup', 'CreditScoreRangeLower',
'CreditScoreRangeUpper', 'TotalCreditLinespast7years',
'DebtToIncomeRatio', 'IncomeVerifiable', 'LoanOriginationQuarter',
'MonthlyLoanPayment', 'Recommendations', 'InvestmentFromFriendsCount',
'InvestmentFromFriendsAmount', 'Investors', 'PercentFunded'],
dtype='object')
Store the cleaned Loan data as a csv file
df_loan_copy.to_csv('cleaned_Loan_prosper_data.csv')
This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others.
I am interested in figuring out what features gives indepth insights related to the borrowers and the loans they took.
Listed Below are the features i will like to check.
#Top 10 states by distribution of borrowers
plt.figure(figsize = [12, 8])
color_pal = sb.color_palette()[5]
order = df_loan_copy['BorrowerState'].value_counts().iloc[:10].index
sb.countplot(data= df_loan_copy, y= 'BorrowerState', color = color_pal, order = order)
plt.title("Distribution of Borrowers in Top States",fontdict = {'weight': 'bold', 'fontsize':16})
plt.xlabel("Frequency", fontdict = {'weight': 'bold', 'fontsize':16})
plt.ylabel("States", fontdict = {'weight': 'bold', 'fontsize':16})
Text(0, 0.5, 'States')
#Distribution of the borrowers across the EmploymentStatus categories
plt.figure(figsize = [12, 8])
color_pal = sb.color_palette()[5]
order_type = df_loan_copy['EmploymentStatus'].value_counts().index
sb.countplot(data= df_loan_copy, y= 'EmploymentStatus', color = color_pal, order = order_type)
plt.title("Distribution of Borrowers across Employment Status",fontdict = {'weight': 'bold', 'fontsize':16})
plt.xlabel("Frequency",fontdict = {'weight': 'bold', 'fontsize':14})
plt.ylabel("Emloyment status",fontdict = {'weight': 'bold', 'fontsize':14});
#Occupation of borrowers
plt.figure(figsize = [12, 8])
color_pal = sb.color_palette()[5]
order = df_loan_copy['Occupation'].value_counts().iloc[:10].index
sb.countplot(data= df_loan_copy, y= 'Occupation', color = color_pal, order = order)
plt.title("Distribution of Borrowers across Top Occupations",fontdict = {'weight': 'bold', 'fontsize':16})
plt.xlabel("Frequency",fontdict = {'weight': 'bold', 'fontsize':14})
plt.ylabel("Occupation",fontdict = {'weight': 'bold', 'fontsize':14})
Text(0, 0.5, 'Occupation')
def distribution (data, onx, sides_to_remove, title, fig_width, fig_height, rotation,\
sort_as_ordinal = None, rotatex = None, subplot = None):
value = df_loan_copy[data].value_counts().index.to_list() if sort_as_ordinal == None else\
df_loan_copy[data].value_counts().sort_index().index.to_list()
fig = plt.figure(figsize = (fig_width, fig_height))
ax1 = plt.subplot(111) if subplot == None else plt.subplot(subplot)
ax1.set_title(title, fontdict = {'color':'brown', 'fontsize':20,\
'fontweight': 'bold'}, pad = 10)
if onx:
#if the categorical variale should be on x-axis
sb.countplot(x = data, data = df_loan_copy, order = value, color = 'brown', ax = ax1)
for i in range(len(value)):
distribution = df_loan_copy[data].value_counts()[value[i]]
percent = (distribution/df_loan_copy[data].shape[0]) * 100
plt.text(x = i, y = distribution , s = '{:0.0f}% ({c})'.format(percent, c = distribution), rotation = 45 if\
rotation else 0, fontsize = 16, ha ='center') # to print the proportions in percentage
plt.xlabel(f'{data}', fontdict = {'weight': 'bold', 'size' : 16})
plt.ylabel(f'Frequency of {data}', fontdict = {'weight': 'bold', 'size' : 16})
plt.xticks(fontsize=16, rotation = 0 if rotatex == False else 90);
plt.yticks(fontsize=16)
else:
sb.countplot(y = data, data = df_loan_copy, order = value, color = 'brown', ax = ax1)
plt.ylabel(f'{data}', fontdict = {'weight': 'bold', 'size' : 16})
plt.xlabel(f'Frequency of {data}', fontdict = {'weight': 'bold', 'size' : 16})
for i in range(len(value)):
distribution = df_loan_copy[data].value_counts()[value[i]]
percent = (distribution/df_loan_copy[data].shape[0]) * 100
plt.text(y = i, x = distribution , s = '{:0.0f}% ({c})'.format(percent, c = distribution), rotation = 45 if\
rotation else 0, fontsize = 16, ha = 'left')
for i in sides_to_remove:
ax1.spines[i].set_visible(False)
#Distribution of ProsperRating (Alpha)
sides = ['top', 'right']
distribution(data = 'ProsperRating (Alpha)', onx = False, sides_to_remove = sides,\
title ='Distribution of ProsperRating (Alpha)', fig_width = 20, fig_height = 8, rotation = False)
#Distribution of ListingCategory
sides = ['top', 'right']
distribution(data = 'ListingCategory', onx = False, sides_to_remove = sides,\
title ='Distribution of Listing Category', fig_width = 10, fig_height = 10, rotation = False)
#Distribution of Loan Status
sides = ['right', 'top']
distribution(data = 'LoanStatus', onx = True, sides_to_remove = sides, title ='Distribution of Loan Status',\
fig_width = 20, fig_height = 10, rotation = True)
##plot of LoanOriginationQuarter
sides = ['right', 'top']
distribution(data = 'LoanOriginationQuarter', onx = True, sides_to_remove = sides, title ='Distribution across Quater',\
fig_width = 20, fig_height = 10, rotation = True, sort_as_ordinal = True, rotatex = False)
distribution(data = 'Recommendations', onx = False, sides_to_remove = ['top'], title ='Distribution of Recommendations',\
fig_width = 20, fig_height = 10, rotation = False, sort_as_ordinal = True, rotatex = False)
f, ax = plt.subplots(2, 2, figsize = (10,10))
color = ['grey', 'brown']
label = [ 'False','True']
def proportion_pie(data, axis_row, axis_column, title, explode, rotation1, rotate_index, wedgeprops = None,):
'''To plot distributions of categorical data in form of proportion
data: str, The elements of the categorical data,
axis_row: int, row position to plot on,
axis_column: int, column position to plot on,
title: str, title of plot,
explode: list, explode values to pass to pie plot,
rotation1: bool, If all texts should be rotated,
rotate_index: int, The index of the text to rotate if rotation1 is false,
wedgeprops: dict, The width of each proportion, to help form a donut chart
'''
data = df_loan_copy[data].value_counts().sort_index()
axes = ax[axis_row][axis_column]
axes.set_title(title, backgroundcolor = 'brown',\
fontdict = {'fontweight': 'bold', 'color': 'white'}, pad = 71)
explode = explode
if wedgeprops != None:
patches, texts, autotexts = axes.pie(data, colors = color, autopct = '%0.0f%%', startangle = 87,
explode = explode, wedgeprops = wedgeprops,\
textprops = {'color': 'white', 'fontweight' : 'bold','fontsize':16}, radius = 2);
else:
patches, texts, autotexts = axes.pie(data, colors = color, autopct = '%0.0f%%', startangle = 87,
explode = explode, textprops = {'color': 'white', 'fontweight' : 'bold','fontsize':16}, radius = 2);
if rotation1:
[text.set_rotation(0) for text in autotexts]
else:
autotexts[rotate_index].set_rotation(0)
legend_properties = {'weight': 'bold'}
f.legend(label, loc = 4 , facecolor = 'teal',borderpad = 4, labelspacing = 4,\
fontsize = 16, prop = legend_properties, labelcolor = 'white',\
bbox_to_anchor = (0, 0.5));
plt.subplots_adjust(wspace = 0.9, hspace = 0.9)
ax[1][1].set_visible(False)
return axes
#percentage of IsBorrowerHomeowner
proportion_pie('IsBorrowerHomeowner', 0, 0, 'Percentage of Borrowers that are home owners', [0.07, 0],\
True,{'width': 1.0})
#percentage of CurrentlyInGroup
proportion_pie('CurrentlyInGroup', 0, 1, 'Percentage of Borrowers that are in a group', [0, 0.07],\
False, 1)
#percentage of IncomeVerifiable
proportion_pie('IncomeVerifiable', 1, 0, 'Percentage of verifiable Income', [0, 0.07],\
False, 0)
plt.show()
# Histogram showing the distribution of the Borrowers DebtToIncomeRatio
plt.figure(figsize = [12, 8])
bin= np.arange(0, df_loan_copy['DebtToIncomeRatio'].max()+0.05, 0.05)
plt.hist(data= df_loan_copy, x = 'DebtToIncomeRatio', bins= bin)
plt.xlim(0, 1)
plt.xlabel('Debt To Income Ratio')
Text(0.5, 0, 'Debt To Income Ratio')
# using subplots to create Histograms showing the distribution of the Credit Score Upper and lower Range
fig, ax = plt.subplots(nrows=2, figsize = [12,8])
variables = ['CreditScoreRangeUpper', 'CreditScoreRangeLower']
for i in range(len(variables)):
var = variables[i]
bins = np.arange(min(df_loan_copy[var]), max(df_loan_copy[var])+20, 20)
ax[i].hist(data = df_loan_copy, x = var, bins = bins)
ax[i].set_xlabel('{}'.format(var))
ax[i].set_xlim(400, 900)
plt.show()
def hist(fig_width, fig_height, data, number_of_bins, title, more_than_a_subplot = False, \
use_another_data = False, another_data = None,):
plt.figure(figsize = (fig_width, fig_height))
ax = plt.subplot(121)
data_to_use = df_loan_copy
bin_number = data_to_use[data[0]].max() / number_of_bins
bins = np.arange(data_to_use[data[0]].min(), data_to_use[data[0]].max() +bin_number, bin_number)
sb.histplot(x = data[0], data = data_to_use, ax = ax, bins = bins)
ax.spines['top'].set_visible(False)
plt.title(title if isinstance(title, str) else title[0],
fontdict = {'fontsize': 20})
plt.xlabel(f'{data[0]}', fontdict = {'weight': 'bold', 'fontsize':16})
plt.ylabel(f'frequency of {data[0]}', fontdict = {'weight': 'bold','fontsize':16})
if more_than_a_subplot:
ax = plt.subplot(122)
if use_another_data:
data_to_use = another_data
else:
data_to_use = data_copy
bin_number = data_to_use[data[1]].max() / number_of_bins
bins = np.arange(data_to_use[data[1]].min(), data_to_use[data[1]].max() +bin_number, bin_number)
sb.histplot(x = data[1], data = data_to_use, ax = ax, bins = bins)
ax.spines['top'].set_visible(False)
plt.title(title[1], fontdict = {'fontsize': 40})
plt.xlabel(f'{data[1]}', fontdict = {'weight': 'bold','fontsize':16})
plt.ylabel(f'frequency of {data[1]}', fontdict = {'weight': 'bold','fontsize':16})
plt.subplots_adjust(wspace = 0.9)
plt.tight_layout()
# Histogram showing the distribution of MonthlyLoanPayment
hist(fig_width = 20, fig_height = 6, data = ['MonthlyLoanPayment'],\
number_of_bins = 27, title = 'Distribution of MonthlyLoanPayment',\
)
# Histogram showing the distribution of Lender Yield
hist(fig_width = 20, fig_height = 6, data = ['LenderYield'],\
number_of_bins = 27, title = 'Distribution of LenderYield',\
)
# Histogram showing the distribution of TotalCreditLinespast7years
hist(fig_width = 20, fig_height = 5, data = ['TotalCreditLinespast7years'],\
number_of_bins = 27, title = 'Distribution of Total Credit Lines in past 7 years',\
)
plt.figure(figsize = (20,8))
plt.subplot(121)
sb.kdeplot(data = df_loan_copy['PercentFunded'], color = 'brown')
plt.title('Distribution of all PercentFunded', fontdict = {'fontsize': 18, 'color': 'black'}, pad = 0.4)
plt.xlabel('Percent Funded', fontdict = {'weight': 'bold'})
plt.ylabel('counts', fontdict = {'weight': 'bold'})
plt.ylim(-4);
#Because the proportion of 1.0 is very large, I want to view the distribution of other percent funded in another subplot
plt.subplot(122)
not_including_1 = df_loan_copy[df_loan_copy['PercentFunded'] < 1]
bins = np.arange(not_including_1['PercentFunded'].min(), not_including_1['PercentFunded'].max() + 0.025, 0.025)
sb.histplot(data = not_including_1, x = 'PercentFunded', bins = bins, color = 'aqua')
plt.title('Distribution of PercentFunded excluding 1.0', fontdict = {'fontsize': 18, 'color': 'black'}, pad = 0.4)
plt.xlim(0.7,0.99,0.25)
plt.xlabel('Percent Funded', fontdict = {'weight': 'bold'})
plt.ylabel('Density', fontdict = {'weight': 'bold'})
plt.subplots_adjust(wspace = 1.0)
#check the percentage of loans that has PercentFunded of 1
(df_loan_copy['PercentFunded'].value_counts().iloc[0]/ df_loan_copy['PercentFunded'].shape[0]) * 100
99.2199772007145
PercentFunded feature is highly skewed with 99.18% of the data being 1.0, thus performing transformation on the feature had no effect on the skewness. But from the observation, 99.18 percent of loans were fully funded.
Of the 0.82% left, most of the loans were 70% funded#check for columns that have high positive correlation
a = df_loan_copy.corr()[df_loan_copy.corr() > 0.4]
sb.heatmap(a)
plt.title('Positive correlation between features greater than 0.4', fontdict = {'weight': 'bold'})
Text(0.5, 1.0, 'Positive correlation between features greater than 0.4')
#check for columns that have high negative correlation
a = df_loan_copy.corr()[df_loan_copy.corr() < -0.4]
sb.heatmap(a)
plt.title('Negative correlation between features less than 0.4', fontdict = {'weight': 'bold'})
Text(0.5, 1.0, 'Negative correlation between features less than 0.4')
#create a pairplot to show correlation with percentFunded
plt.figure(figsize = (10,10))
g = sb.pairplot(data = df_loan_copy, y_vars = 'PercentFunded')
plt.title('Pairplot of other features against Percent Funded to confirm', fontdict ={'weight': 'bold'})
Text(0.5, 1.0, 'Pairplot of other features against Percent Funded to confirm')
<Figure size 720x720 with 0 Axes>
def boxplots(fig_width, fig_height, cat_data, num_data, data, title,
multiple_subplots = False, use_another_data = False, another_data = None):
plt.figure(figsize = (fig_width, fig_height))
plt.subplot(211)
sb.boxplot(x = cat_data[0], y = num_data[0], data = df_loan_copy, color = 'teal')
plt.xticks(rotation = 90)
plt.title(title[0], fontdict = {'fontweight': 'bold', 'fontsize' : 24})
if multiple_subplots: #If there should be another subplot
plt.subplot(212)
if use_another_data:#If another data should be used
data_to_use = another_data
else:
data_to_use = data
sb.boxplot(x = cat_data[0], y = num_data[0], data = data_to_use, color = 'teal')
plt.title(title[1], fontdict = {'fontweight': 'bold', 'fontsize' : 24})
plt.xticks(rotation = 90);
plt.subplots_adjust(hspace = 1.4)
plt.xlabel(f'{cat_data[0]}', fontdict = {'weight': 'bold'})
plt.ylabel(f'{num_data[0]}', fontdict = {'weight': 'bold'})
def full_funding(cat_data, title, fig_size = None):
if fig_size != None:
plt.figure(figsize = fig_size)
full_funding = df_loan_copy[df_loan_copy['PercentFunded'] == 1]# data of 100% funding
sb.countplot(x = cat_data, data = full_funding, color = 'brown')
plt.xticks(rotation = 90);
plt.title(f'Counts of {title} that had full percent funding',
fontdict = {'fontweight': 'bold', 'fontsize' : 24});
plt.xlabel(f'{cat_data}', fontdict = {'weight': 'bold'})
plt.ylabel(f'Counts of {cat_data[0]} for loans with full funding', fontdict = {'weight': 'bold'})
#Relationship between percentFunded and Loanstatus
not_including_1 = df_loan_copy[df_loan_copy['PercentFunded'] < 1]
boxplots(10, 10, ['LoanStatus'], ['PercentFunded'], df_loan_copy,
['Relationship between percentFunded and Loanstatus',
'Relationship between percentFunded and Loanstatus excluding 1'],
multiple_subplots = True, \
use_another_data = True, another_data = not_including_1)
#Relationship between percentFunded and Loanstatus for fully funded loans
full_funding('LoanStatus', 'loan status')
#Relationship between percentFunded and Loanstatus ProsperRating (Alpha)
not_including_1 = df_loan_copy[df_loan_copy['PercentFunded'] < 1]
boxplots(10, 10, ['ProsperRating (Alpha)'], ['PercentFunded'], df_loan_copy,
['Relationship between percentFunded and ProsperRating',
'Relationship between percentFunded and ProsperRating excluding 1'],
multiple_subplots = True, \
use_another_data = True, another_data = not_including_1)
#Relationship between percentFunded and ProsperRating (Alpha) for fully funded loans
full_funding('ProsperRating (Alpha)', 'ProsperRating')
#Relationship between percentFunded and ListingCategory
not_including_1 = df_loan_copy[df_loan_copy['PercentFunded'] < 1]
boxplots(10, 10, ['ListingCategory'], ['PercentFunded'], df_loan_copy,
['Relationship between percentFunded and ListingCategory',
'Relationship between percentFunded and ListingCategory excluding 1'],
multiple_subplots = True, \
use_another_data = True, another_data = not_including_1)
#Relationship between percentFunded and ListingCategory for fully funded loans
full_funding('ListingCategory', 'Listing Category', (10,4))
#Relationship between percentFunded and BorrowerState
not_including_1 = df_loan_copy[df_loan_copy['PercentFunded'] < 1]
boxplots(20, 10, ['BorrowerState'], ['PercentFunded'], df_loan_copy,
['Relationship between percentFunded and Borrower State',
'Relationship between percentFunded and Borrower State excluding 1'],
multiple_subplots = True, \
use_another_data = True, another_data = not_including_1)
#Relationship between percentFunded and BorrowerState for fully funded loans
full_funding('BorrowerState', 'States', (15, 5))
#Relationship between percentFunded and Occupation
boxplots(20, 10, ['Occupation'], ['PercentFunded'], df_loan_copy,
['Relationship between percentFunded and Occupation',
'Relationship between percentFunded and Occupation excluding 1'],
multiple_subplots = True, \
use_another_data = True, another_data = not_including_1)
#Relationship between percentFunded and Occupation for fully funded loans
full_funding('Occupation', 'Occupation', (15,8))
#Relationship between percentFunded and EmploymentStatus
boxplots(20, 10, ['EmploymentStatus'], ['PercentFunded'], df_loan_copy,
['Relationship between percentFunded and Employment Status',
'Relationship between percentFunded and Employment Status excluding 1'],
multiple_subplots = True, \
use_another_data = True, another_data = not_including_1)
#Relationship between percentFunded and EmploymentStatus for fully funded loans
full_funding('EmploymentStatus', 'EmploymentStatus')
#Relationship between percentFunded and IsBorrowerHomeowner
boxplots(20, 10, ['IsBorrowerHomeowner'], ['PercentFunded'], df_loan_copy,
['Relationship between percentFunded and IsBorrowerHomeowner',
'Relationship between percentFunded and IsBorrowerHomeowner excluding 1'],
multiple_subplots = True, \
use_another_data = True, another_data = not_including_1)
#Relationship between percentFunded and IsBorrowerHomeowner for fully funded loans
full_funding('IsBorrowerHomeowner', 'Homeowner')
#Relationship between percentFunded and CurrentlyInGroup
boxplots(20, 10, ['CurrentlyInGroup'], ['PercentFunded'], df_loan_copy,
['Relationship between percentFunded and CurrentlyInGroup',
'Relationship between percentFunded and CurrentlyInGroup excluding 1'],
multiple_subplots = True, \
use_another_data = True, another_data = not_including_1)
#Relationship between percentFunded and CurrentlyInGroup for fully funded loans
full_funding('CurrentlyInGroup', 'borrowers in groups')
#Relationship between percentFunded and LoanOriginationQuarter
boxplots(20, 10, ['LoanOriginationQuarter'], ['PercentFunded'], df_loan_copy,
['Relationship between percentFunded and LoanOriginationQuarter',
'Relationship between percentFunded and LoanOriginationQuarter excluding 1'],
multiple_subplots = True, \
use_another_data = True, another_data = not_including_1)
#Relationship between percentFunded and LoanOriginationQuarter for fully funded loans
full_funding('LoanOriginationQuarter', 'Quarter that had loans')
#Relationship between percentFunded and IncomeVerifiable
boxplots(20, 10, ['IncomeVerifiable'], ['PercentFunded'], df_loan_copy,
['Relationship between percentFunded and IncomeVerifiable',
'Relationship between percentFunded and IncomeVerifiable excluding 1'],
multiple_subplots = True, \
use_another_data = True, another_data = not_including_1)
#Relationship between percentFunded and IncomeVerifiable for fully funded loans
full_funding('IncomeVerifiable', 'verified income')
def heatmap(x_data, y_data, number_of_bins, fig_size,more_subplot = False, x_data2 = None):
plt.figure(figsize = fig_size)
plt.subplot(211) if more_subplot else plt.subplot(111)
no_of_bins_x = not_including_1[x_data].max()/ number_of_bins
no_of_bins_y = not_including_1[y_data].max()/ number_of_bins
bins_x = np.arange(not_including_1[x_data].min(),
not_including_1[x_data].max() + no_of_bins_x, no_of_bins_x)
bins_y = np.arange(not_including_1[y_data].min(),
not_including_1[y_data].max() + no_of_bins_y, no_of_bins_y)
plt.hist2d(x = x_data, y = y_data, data = not_including_1,
bins = [bins_x, bins_y], cmap = 'viridis_r', )
plt.title(f'Relationship between percentFunded and {x_data}')
plt.xlabel(f'{x_data}', fontdict = {'weight': 'bold'})
plt.ylabel(f'{y_data}', fontdict = {'weight': 'bold'})
if more_subplot:
plt.subplot(212)
no_of_bins_x = not_including_1[x_data2].max()/ number_of_bins
bins_x = np.arange(not_including_1[x_data2].min(),
not_including_1[x_data2].max() + no_of_bins_x, no_of_bins_x)
plt.hist2d(x = x_data2, y = y_data, data = not_including_1,
bins = [bins_x, bins_y], cmap = 'viridis_r', )
plt.title(f'Relationship between percentFunded and {x_data2}')
plt.xlabel(f'{x_data2}', fontdict = {'weight': 'bold'})
plt.ylabel(f'{y_data}', fontdict = {'weight': 'bold'})
plt.subplots_adjust(hspace = 0.9)
plt.colorbar();
def contvcont(x_data, y_data, number_of_bins, fig_size,more_subplot = False, x_data2 = None):
plt.figure(figsize = fig_size)
plt.subplot(211) if more_subplot else plt.subplot(111)
no_of_bins_x = not_including_1[x_data].max()/ number_of_bins
no_of_bins_y = not_including_1[y_data].max()/ number_of_bins
bins_x = np.arange(not_including_1[x_data].min(),
not_including_1[x_data].max() + no_of_bins_x, no_of_bins_x)
bins_y = np.arange(not_including_1[y_data].min(),
not_including_1[y_data].max() + no_of_bins_y, no_of_bins_y)
sb.regplot(x = x_data, y = y_data, data = not_including_1,)
plt.title(f'Relationship between percentFunded and {x_data}')
plt.xlabel(f'{x_data}', fontdict = {'weight': 'bold'})
plt.ylabel(f'{y_data}', fontdict = {'weight': 'bold'})
if more_subplot:
plt.subplot(212)
no_of_bins_x = not_including_1[x_data2].max()/ number_of_bins
bins_x = np.arange(not_including_1[x_data2].min(),
not_including_1[x_data2].max() + no_of_bins_x, no_of_bins_x)
sb.regplot(x = x_data, y = y_data, data = not_including_1)
plt.title(f'Relationship between percentFunded and {x_data2}')
plt.xlabel(f'{x_data}', fontdict = {'weight': 'bold'})
plt.ylabel(f'{y_data}', fontdict = {'weight': 'bold'})
plt.subplots_adjust(hspace = 0.9)
def full_fund_hist(data, number_of_bins, fig_size,more_subplot = False, data2 = None, log_plot = False):
full_funding = df_loan_copy[df_loan_copy['PercentFunded'] == 1]
plt.figure(figsize = fig_size)
plt.subplot(121) if more_subplot else plt.subplot(111)
no_of_bins = full_funding[data].max()/ number_of_bins
bins = np.arange(full_funding[data].min(),
full_funding[data].max() + no_of_bins, no_of_bins)
sb.histplot(x = data, data = full_funding, bins = bins, stat = 'density')
plt.title(f'{data} with full percentFunded')
plt.xlabel(f'{data}', fontdict = {'weight': 'bold'})
plt.ylabel(f'frequency of {data} for fully funded loans', fontdict = {'weight': 'bold'})
if more_subplot:
plt.subplot(122)
if log_plot:
no_of_bins = np.log(full_funding[data] + 1).max()/ number_of_bins
bins = np.arange(np.log(full_funding[data] + 1).min(),
np.log(full_funding[data] + 1).max() + no_of_bins, no_of_bins)
plt.xscale('log')
plt.xlim(0.9e0, 2e1)
plt.xticks([0.9e0, 2e0, 4e0, 6e0, 8e0, 1e1], ['0', '2', '4', '6', '8', '10'])
else:
no_of_bins = full_funding[data2].max()/ number_of_bins
bins = np.arange(full_funding[data2].min(),
full_funding[data2].max() + no_of_bins, no_of_bins)
sb.histplot(x = data if log_plot else data2
, data = full_funding, bins = bins, stat = 'density')
plt.xlabel(f'{data}', fontdict = {'weight': 'bold'})
plt.ylabel(f'frequency of {data} for fully funded loans', fontdict = {'weight': 'bold'})
if log_plot:
plt.title(f'logarithm plot of {data} with full percentFunded')
else:
plt.title(f'logarithm plot of {data2} with full percentFunded')
plt.subplots_adjust(wspace = 0.9)
contvcont('CreditScoreRangeLower', 'PercentFunded', 50, (10,10),more_subplot = True, x_data2 = 'CreditScoreRangeUpper')
full_fund_hist('CreditScoreRangeLower', 50, (10,10),more_subplot = True, data2 = 'CreditScoreRangeUpper')
heatmap('TotalCreditLinespast7years', 'PercentFunded', 50, (15, 6))
full_fund_hist('TotalCreditLinespast7years', 50, (10,6))
full_fund_hist('DebtToIncomeRatio', 50, (10,6), more_subplot = True, log_plot = True)
contvcont('DebtToIncomeRatio', 'PercentFunded', 50, (10,8),\
more_subplot = True, x_data2 = 'CreditScoreRangeUpper')
def correlation(fig_size, data, multiple_subplot = False):
plt.figure(figsize = fig_size)
plt.subplot(121) if multiple_subplot else plt.subplots(111)
sb.regplot(x = data, y = 'PercentFunded', data = df_loan_copy)
plt.title(f'Relationship between PercentFunded \nand {data}')
plt.xlabel(f'{data}', fontdict = {'weight': 'bold'})
plt.ylabel(f'PercentFunded', fontdict = {'weight': 'bold'})
if multiple_subplot:
plt.subplot(122)
sb.regplot(x = data, y = 'PercentFunded', data = not_including_1)
plt.title(f'Relationship between PercentFunded \nand {data} of loans'
'that were not fully funded')
plt.subplots_adjust(wspace = 0.9)
plt.xlabel(f'{data}', fontdict = {'weight': 'bold'})
plt.ylabel(f'PercentFunded', fontdict = {'weight': 'bold'})
def fully_funded_for_numeric(fig_size, column):
plt.figure(figsize = fig_size)
plt.subplot(111)
data_to_use = df_loan_copy[df_loan_copy['PercentFunded'] == 1]
no_of_bins = data_to_use[column].max()/ 50
bins = np.arange(data_to_use[column].min(),
data_to_use[column].max() + no_of_bins, no_of_bins)
sb.histplot(x = column, data = data_to_use, bins = bins)
plt.xlabel(f'{column}', fontdict = {'weight': 'bold'})
plt.ylabel(f'Frequency of {column} for fully funded loans', fontdict = {'weight': 'bold'})
fully_funded_for_numeric((10,4), 'DebtToIncomeRatio')
correlation((10, 8), 'MonthlyLoanPayment', multiple_subplot = True)
correlation((10, 8), 'Recommendations', multiple_subplot = True)
def recommendations(column):
'''To plot distribution of numeric data on recommendations for loans not fully funded
column: str, The numeric data to plot,
'''
fig, ax = plt.subplots(1,2)
reco = not_including_1['Recommendations'].unique()
for i, ax in zip(reco, ax.ravel()):
no_recommendation = not_including_1[not_including_1['Recommendations'] == i]
sb.boxplot(x = 'Recommendations', y = column, data = no_recommendation, ax = ax)
plt.text(x = 1, y = 0.8, s = 'The only number of recommendation of loans that do not have '
'full funding are 0 and 1')
ax.set_xlabel('Recommendation', fontdict = {'weight': 'bold'})
ax.set_ylabel(f'PercentFunded', fontdict = {'weight': 'bold'})
plt.title(f'Relationship between Recomendations and {column} of not fully funded loans',
fontdict = {'weight':'bold'})
plt.subplots_adjust(wspace = 0.9)
recommendations('PercentFunded')
recommendations('InvestmentFromFriendsCount')
full_funded = df_loan_copy[df_loan_copy['PercentFunded'] == 1]
ax = plt.subplot(111)
prop = full_funded['Recommendations'].value_counts() / full_funded['Recommendations'].shape[0]
label = [str(x) for x in full_funded['Recommendations'].value_counts().index.to_list()]
sb.barplot(x = prop, y = label, ax = ax)
for i in range(len(label)):
prop = full_funded['Recommendations'].value_counts().iloc[i] / \
full_funded['Recommendations'].shape[0]
percent = prop * 100
value = full_funded['Recommendations'].value_counts().iloc[i]
plt.text(y = i, x = prop, s = '{:0.0f}% ,{}'.format(percent, value))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.title('Proportion of Recommendations for fully funded loans', fontdict = {'Weight': 'bold'},
pad = 7)
plt.xlabel('proportion', fontdict = {'weight':'bold'})
plt.ylabel('Recommendations', fontdict = {'weight':'bold'});
correlation((10, 8), 'InvestmentFromFriendsAmount', multiple_subplot = True)
def relation(x_data, y_data, data):
'''To plot relationship between numeric data
x_data: str, The numeric data to plot on x-axis,
y_data: str, The numeric data to plot on y-axis,
data: DataFrame, The data to plot from
'''
sb.regplot(x = x_data, y = y_data, data = data);
plt.title(f'relationship between {x_data} and {y_data}', fontdict = {'weight':'bold'}, pad = 7)
plt.xlabel(f'{x_data}', fontdict = {'weight':'bold'})
plt.ylabel(f'{y_data}', fontdict = {'weight':'bold'});
relation('InvestmentFromFriendsCount', 'Investors', df_loan_copy)
relation('Recommendations', 'InvestmentFromFriendsCount', df_loan_copy)
relation('LenderYield', 'PercentFunded', not_including_1)
def cat_rel(column, hue):
'''To plot counts of two categorical varable
x_data: str, The numeric data to plot on x-axis,
y_data: str, The numeric data to plot on y-axis,
data: DataFrame, The data to plot from
'''
sb.countplot(x = column, hue = hue, data = df_loan_copy)
plt.xticks(rotation = 90)
plt.title(f'Proportion of {column} and {hue}', fontdict = {'weight':'bold'}, pad = 7)
plt.xlabel(f'{column}', fontdict = {'weight':'bold'})
plt.ylabel(f'counts of {column} and {hue}', fontdict = {'weight':'bold'});
cat_rel('EmploymentStatus', 'IncomeVerifiable',)
cat_rel('ProsperRating (Alpha)', 'CurrentlyInGroup',)
def vari(col = None, numerical = False, facet = False, minimum = None, pad = None):
'''To plot relationship between LenderYield, PercentFunded and one categorical variable or
numerical variable
col: str, categorical variable to plot,
numerical: str, numerical variable to plot,
facet: bool, if it should be facet plot,
minimum: int, minimum value for colorbar,
pad = The pad for title
'''
plt.figure(figsize = (10,5))
if numerical:
plt.scatter(data = not_including_1, x = 'LenderYield', y = 'PercentFunded', c = col, \
cmap = 'Accent', vmin = minimum)
plt.colorbar(label = col)
elif facet:
g = sb.FacetGrid(data = not_including_1, col = col, col_wrap =3)
g.map(sb.regplot, 'LenderYield', 'PercentFunded')
plt.title(f'Relationship between LenderYield, PercentFunded and {col}',
fontdict = {'weight': 'bold'}, pad = pad)
else:
sb.scatterplot(x = 'LenderYield', y = 'PercentFunded', data = not_including_1,\
hue = col, s = 70);
if not facet:
plt.title(f'Relationship between LenderYield, PercentFunded and {col}',
fontdict = {'weight': 'bold'}, pad = 10)
plt.xlabel('LenderYield', fontdict = {'weight':'bold'})
plt.ylabel('PercentFunded', fontdict = {'weight':'bold'});
plt.legend(loc = 'right', bbox_to_anchor = (1.5,1))
vari(col = 'EmploymentStatus')
vari('LoanStatus', facet = True, pad = 650)
No handles with labels found to put in legend.
<Figure size 720x360 with 0 Axes>
vari('ProsperRating (Alpha)')
vari('CurrentlyInGroup')
vari(col = 'LoanOriginationQuarter', facet= True, pad = 250)
No handles with labels found to put in legend.
<Figure size 720x360 with 0 Axes>
vari(col = 'IncomeVerifiable')
vari(numerical= True, col= 'TotalCreditLinespast7years')
vari(numerical= True, col= 'CreditScoreRangeLower')
vari(numerical= True, col = 'CreditScoreRangeUpper', minimum = 19)
vari(numerical= True, col = 'MonthlyLoanPayment')
Exploring the distribution of univariables of interest, one of the insightful information discovered was that CA State has the highest borrowers. Looking at Borrower's employment status, I discovered majority of the borrowers have either employed or fulltime employment status. Investigating further into their source of income, I discovered that their income mostly ranges from 25,000-74,999 and their monthly income distribution is skewed to the right and they are usually less than 30k. Their income ratio is right skewed as well. Further into Bivariance exploration,to observe the relationships between 2 variables each of the data, I discovered that there was a high correlation between 'Recommendations'and 'friends that invest' with Little to no correlation between LenderYield and PercentFunded Lastly on Multivariant exploration,